Jupyter is a way to combine markdown documentation, code, graphics and data in an easy-to-read document that renders in a web browser. The notebook itself is stored as a text file in JSON format.
It is language agnostic. The name "Jupyter" is a combination of Julia (a new language for scientific computing), Python (which you know and love, or at least will when the course is over), and R (the dominant tool for statistical computation). However, you currently can run over 40 different languages in a Jupyter notebook, not just Julia, Python, and R.
# Standard Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# Plot.ly Imports
import plotly as plty
import plotly.graph_objs as go
import cufflinks as cf
cf.set_config_file(offline=True, theme='ggplot', offline_link_text=None, offline_show_link=False)
# Seaborn Imports
import seaborn as sns
sns.set_style('whitegrid')
import sys
sys.version
# Creating a data frame from dictionaries (each dict is a column)
d = {
'x': [1, 2, 3],
'y': ['a', 'b', 'c']
}
d = pd.DataFrame(d)
d
d.info()
d.describe().T
# Creating a data from tuples (each tuple is a row)
d = [
(1, 'a'),
(2, 'b'),
(3, 'c')
]
d = pd.DataFrame(d, columns=['x', 'y'])
d
type(d)
d['x']
type(d['x'])
pd.Series([1, 2, 3])
pd.Series([1, 2, np.nan])
pd.Series([1, np.nan, 'x'])
%%bash
cat << EOF > /tmp/testdata.csv
x y
1 a
2 b
3 c
EOF
!cat /tmp/testdata.csv
pd.read_csv('/tmp/testdata.csv', sep=' ')
%%time
import time
time.sleep(1)
%%timeit
import time
time.sleep(.1)
%load_ext rpy2.ipython
%%R
x <- rnorm(1000) # Create random data
plot(density(x)) # Plot density
Both data frames and series always have an "index". Sometimes that index is meaningful and sometimes it's not (it can just be the row number). Either way, indexes are an important aspect of many operations in pandas and they also open a lot of doors.
d = pd.DataFrame(np.random.randn(10, 3),
columns=['x', 'y', 'z'],
index=['Row {}'.format(i) for i in range(10)])
d.index.name = 'Row'
print(d.shape)
d
d = d.reset_index()
d
d = d.set_index(['Row', 'x'])
d
d.columns
d.index
Pandas has a huge number of built-in, well optimized (i.e. cythonized) commands for doing vectorized operations as part of grouping, sorting, aggregating, filtering, pivoting, etc. workflows. Here are a few examples:
# Getting ahead of things here, but scikit-learn has a lot of built-in datasets for this sort of testing
from sklearn import datasets
# Load in the famouse Iris dataset
iris = datasets.load_iris()
Scikit-learn datasets are always given as numpy matrices, so knowing how to convert those to pandas data frames is a good skill to have (it comes up in plenty of other places too).
iris.data[:5], np.unique(iris.target)
iris.feature_names, iris.target_names
# Map target values to the corresponding names
iris.target_names[iris.target][:6]
# Create a data frame containing only the features first
d = pd.DataFrame(iris.data, columns=iris.feature_names)
d.head()
# Now add in the "Target" value which in this case is the species of the flower
d = d.assign(species=iris.target_names[iris.target])
# This would work just as well, but the .assign method is good to know for chaining operations together
# d['species'] = iris.target_names[iris.target]
d.head()
# The data frame still maintains a numpy matrix under the hood, and you can get at it with .values
d.values[:5]
Having "(cm)" in every column is annoying, so they can easily be renamed through functional utilities:
d = d.rename(columns=lambda c: c.replace(' (cm)', ''))
# Alternative (but generally worse methods)
# d.columns = ['sepal length', 'sepal width', 'petal length', 'species']
# d.columns = [c.replace(' (cm)', '') for c in d]
d.head()
d[['sepal length', 'species']].head()
# Alternatively (and better)
d.filter(items=['sepal length', 'species']).head()
# Or
d.filter(regex='sepal length|species').head()
d[d['species'] == 'setosa'].head(2)
d[(d['species'] == 'setosa') & (d['petal length'] > 1.4)].head(2)
d.query("species == 'setosa'").head(2)
d.loc[lambda x: x['species'] == 'setosa'].head(2)
d.query("species == 'setosa' and petal length > 1.4").head(2)
d.groupby('species').mean()
d.groupby('species').describe().head(10)
d.groupby('species')['petal length'].mean()
d.groupby('species')['petal length'].agg({'mean': np.mean, 'median': np.median})
d.groupby('species')[['petal length']].mean()
# Apply a function of some kind to each numeric column in the frame
d.select_dtypes(include=[np.number]).apply(lambda col: col > col.mean()).head()
# Apply custom logic per group
# First, let's create a new variable by binning the sepal length value
d = d.assign(sepal_length_bucket=lambda x: pd.cut(x['sepal length'], bins=5))
d.head()
# Now get the most commonly occurring species amongst each sepal length bin
d.groupby('sepal_length_bucket').apply(lambda g: g['species'].value_counts().idxmax())
d['petal length'].describe()
d.sort_values('petal length').head(3)
d.sort_values('petal length', ascending=False).head(3)
IMO this the hardest thing to conceptualize in data analysis, but it comes up very frequently. "Reshaping" data means being able to move data from wide to long formats effectively, usually in an effort to reach a "tidy" representation. It's hard to imagine something better than Pandas for that though.
# Load the "Boston Housing Price" dataset so we have something practical to work with
bost = datasets.load_boston()
print(bost['DESCR'])
d_bost = pd.DataFrame(bost.data, columns=bost.feature_names)\
.assign(PRICE=bost.target)\
.filter(items=['CRIM', 'RM', 'AGE', 'PRICE'])
d_bost.head()
d = d_bost.apply(pd.qcut, q=[0, .25, .5, .75, 1], labels=['Low', 'Med', 'High', 'Very High'])
d.head()
d.groupby(['PRICE', 'AGE']).size().head()
# Count houses by age and price group
d_pr_v_age = d.groupby(['PRICE', 'AGE']).size().unstack()
d_pr_v_age
d_pr_v_age.stack().head()
# Compute counts as percentages instead
(d_pr_v_age / d_pr_v_age.sum().sum()).mul(100).round(2)
(d_pr_v_age / d_pr_v_age.sum().sum()).mul(100).round(2)\
.style.background_gradient(cmap='autumn_r')
# Count by age, price, and number of rooms
d_pvar = d.groupby(['PRICE', 'AGE', 'RM']).size().unstack().unstack()
d_pvar
# Highlighting differences in counts
plt.figure(figsize=(20, 4))
sns.set(font_scale=1.5)
sns.heatmap(d_pvar, annot=True)
d_hm = d.groupby(['PRICE', 'AGE', 'RM']).size()
d_hm = d_hm.rename('Count').reset_index()
def heatmap(data, **kwargs):
d = data.pivot_table(index='PRICE', columns='AGE', values='Count', aggfunc='sum')
return sns.heatmap(d, cbar=False, annot=True)
sns.FacetGrid(d_hm, col='RM', size=5)\
.map_dataframe(heatmap)\
.set_xlabels('AGE')\
.set_ylabels('PRICE')
Pandas also has two functions, pivot_table and melt that can accomplish many of the same things as stack and unstack.
The main difference between the two sets of commands then is that pivot_table/melt operate starting with unindexed data frames, or at least, data frames with indexes you don't care about. This makes them much more familiar for SAS/R users that might not be used to indexes being a formal part of data structures.
The stack/unstack commands on the other hand, operate solely by moving data from row indexes to column indexes (or vise versa). They're much more specific to pandas (aka more "pandorable") because of this, but they are generally more intuitive when you get into a step-wise flow of operations.
d.head()
d_pr_v_age = d.pivot_table(index='PRICE', columns='AGE', values='RM', aggfunc='count')
d_pr_v_age
d_melt = d_pr_v_age.copy()
d_melt.columns = d_melt.columns.astype(str)
d_melt = d_melt.reset_index()
d_melt.head()
pd.melt(d_melt, id_vars=['PRICE'], value_name='COUNT')
There are 3 major aspects of visualization in Python (IMO):
d = pd.DataFrame(bost.data, columns=bost.feature_names)\
.assign(PRICE=bost.target)\
.filter(items=['CRIM', 'RM', 'AGE', 'PRICE'])
d.head()
# The "pyplot" module (aliased as "plt" here) has many functions for different kinds of visuals
fig = plt.figure(figsize=(6, 6))
plt.scatter(d['AGE'], d['PRICE']) # Scatter plot example
plt.title('Price vs Age')
plt.xlabel('Age')
plt.ylabel('Price')
# The "plot" function attached to DataFrames and Series are really more like a Pandas plotting library,
# but it's using Matplotlib under the hood
d.plot(kind='scatter', x='AGE', y='PRICE', figsize=(6, 6), title='Price vs Age')
d['PRICE'].plot(kind='kde', title='Price Density Estimate')
d_hist = d.copy()
d_hist[['AGE', 'CRIM']] = d_hist[['AGE', 'CRIM']]\
.apply(pd.qcut, q=[0, .33, .66, 1], labels=['Low', 'Med', 'High'])
d_hist.head()
sns.FacetGrid(d_hist, col="CRIM", row="AGE", sharex=True, sharey=False, margin_titles=True, size=4)\
.map(sns.distplot, 'PRICE')
sns.pairplot(d)
Interactive graphics are nice, but being "interactive" isn't just a frill. These things can make visualizations way more practical and lead to a lot less coding -- Plot.ly is a great example of this:
# Create a large dataset with a large number of different timeseries
dates = pd.date_range('2016-01-01', '2016-07-01', freq='10T')
len(dates), dates[:5]
def get_wave_pattern(x):
n = len(x)
i = np.arange(n)
a1 = .1 + np.random.rand(1) * .3
f1 = 100
w1 = (i % f1) / float(f1)
a2 = .1 + np.random.rand(1) * .8
f2 = 10000
w2 = (i % f2) / float(f2)
return a1 * np.sin(w1 * 2 * np.pi) + a2 * np.sin(w2 * 2 * np.pi) + x
np.random.seed(1)
n_ts = 6
d = pd.DataFrame(np.random.randn(len(dates), n_ts) * .3, index=dates).add_prefix('TS').apply(get_wave_pattern)
d.head()
# First, show this data with Matplotlib for comparison
d.plot(figsize=(16, 5), legend=False)
d.iplot()
# Saving html plots to a file
fig = d.iplot(asFigure=True)
plty.offline.plot(fig, filename='/tmp/plotly_test.html')
d.rolling(50, min_periods=1, center=True).mean().iplot()